Missing Fairness

Analysis of fairness metrics with missing data (work in progress)

David Nieves-Cordones https://github.com/DNC87 (Technical University of valencia) , Fernando Martinez-Plumed https://nandomp.github.io/ (Technical University of valencia)
January 7, 2019

Table of Contents


Datasets

Adult Census Income

Description:

Prediction task is to determine whether a person makes over 50K a year. Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))

Cite:

Ron Kohavi, “Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid”, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996

Fariness Analysis:

For protected attribute sex, Male is privileged, and Female is unprivileged. For protected attribute race, White is privileged, and Non-white is unprivileged. Favorable label is High income (> 50K) and unfavorable label is Low income (<= 50K).

Missing Values:


 Variables sorted by number of missings: 
        Variable      Count
      occupation 0.05660146
       workclass 0.05638647
  native.country 0.01790486
             age 0.00000000
          fnlwgt 0.00000000
       education 0.00000000
   education.num 0.00000000
  marital.status 0.00000000
    relationship 0.00000000
            race 0.00000000
             sex 0.00000000
    capital.gain 0.00000000
    capital.loss 0.00000000
  hours.per.week 0.00000000
 income.per.year 0.00000000

 Missings in variables:
       Variable Count
      workclass  1836
     occupation  1843
 native.country   583

Titanic

Description:

The kaggle Titanic dataset, describing the survival status of individual passengers on the Titanic. The titanic data does not contain information from the crew, but it does contain actual ages of half of the passengers. The principal source for data about Titanic passengers is the Encyclopedia Titanica. The datasets used here were begun by a variety of researchers. One of the original sources is Eaton & Haas (1994) Titanic: Triumph and Tragedy, Patrick Stephens Ltd, which includes a passenger list created by many researchers and edited by Michael A. Findlay. For more information about how this dataset was constructed: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3info.txt

Cite:

http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html

Fariness Analysis:

For protected attribute sex, Female is privileged, and Male is unprivileged. For protected attribute pclass (proxy for socio-economic class), 1st class is privileged, and 2n and 3rd class are unprivileged. Favorable label is survived (survived = TRUE) and unfavorable label is die (Survived = FALSE).

Missing Values:


 Variables sorted by number of missings: 
 Variable        Count
      age 0.2009167303
 embarked 0.0015278839
     fare 0.0007639419
   pclass 0.0000000000
 survived 0.0000000000
     name 0.0000000000
      sex 0.0000000000
    sibsp 0.0000000000
    parch 0.0000000000
   ticket 0.0000000000

 Missings in variables:
 Variable Count
      age   263
     fare     1
 embarked     2

Irish

Description:

Data on educational transitions for a sample of 500 Irish schoolchildren aged 11 in 1967. The data were collected by Greaney and Kelleghan (1984), and reanalyzed by Raftery and Hout (1985, 1993).

Cite:

http://lib.stat.cmu.edu/datasets/irish.ed

Fariness Analysis:

For protected attribute sex, Male is privileged, and Female is unprivileged (Irish_1 version). For protected attribute sex, Female is privileged, and Male is unprivileged (Irish_2 version). In both versions, favorable label is Leaving Certificate taken (1) and unfavorable label is Leaving Certificate not taken (2).

Missing Values:


 Variables sorted by number of missings: 
            Variable Count
      Prestige_score 0.052
   Educational_level 0.012
                 Sex 0.000
                DVRT 0.000
 Leaving_Certificate 0.000
         Type_school 0.000

 Missings in variables:
          Variable Count
 Educational_level     6
    Prestige_score    26

Imputation Methods

Phases

Fairness pipeline followed (from AIF360). An example instantiation of this generic pipeline consists of loading data into a dataset object, transforming it into a fairer dataset using a fair pre-processing algorithm, learning a classifier from this transformed dataset, and obtaining predictions from this classifier. Metrics can be calculated on the original, transformed, and predicted datasets as well as between the transformed and predicted datasets. Many other instantiations are also possible.

Figure from https://github.com/IBM/AIF360

Figure 1: Figure from https://github.com/IBM/AIF360

Preprocessing: Dataset Fairness Metrics

Metrics:

Datasets without fair pre-processing

Results:

Analysis:

We consider the imputed dataset “Cols” (where we remove columns containing missing values) as the gold standar to which we compare the results obtained from the other imputed datasets.

Datasets with fair pre-processing

Technique :

Results:

Analysis:

We consider the imputed dataset “Cols” (where we remove columns containing missing values) as the gold standar to which we compare the results obtained from the other imputed datasets.

Inprocessing: (Fairness-aware) Model Metrics

Metrics:

Techniques:

Fairnes-aware Techniques:

Classifiers without bias mitigation (no preprocessing)

Results:

Classifiers without bias mitigation (with preprocessing)

Results:

Classifiers with bias mitigation (no preprocessing)

Results

Pareto Plots

Adult dataset

Figure 2: Adult dataset

Titanic dataset

Figure 3: Titanic dataset

Irish_1 dataset

Figure 4: Irish_1 dataset

Irish_2 dataset

Figure 5: Irish_2 dataset

Recidivsm dataset (1 seed)

Figure 6: Recidivsm dataset (1 seed)

Violent Recidivism dataset (1 seed)

Figure 7: Violent Recidivism dataset (1 seed)

Postprocessing

Citation

For attribution, please cite this work as

Nieves-Cordones & Martinez-Plumed (2019, Jan. 7). Missing Fairness. Retrieved from https://nandomp.github.io/R/missingFairness.html

BibTeX citation

@misc{nieves2019missingfairness,
  author = {Nieves-Cordones, David and Martinez-Plumed, Fernando},
  title = {Missing Fairness},
  url = {https://nandomp.github.io/R/missingFairness.html},
  year = {2019}
}